Automatic Discovery of Protein Motifs Using Genetic Programming
نویسنده
چکیده
Automated methods of machine learning may prove to be useful in discovering biologically meaningful information hidden in the rapidly growing databases of DNA sequences and protein sequences. Genetic programming is an extension of the genetic algorithm in which a population of computer programs is bred, over a series of generations, in order to solve a problem. Genetic programming is capable of evolving complicated problem-solving expressions of unspecified size and shape. Moreover, when automatically defined functions are added to genetic programming, genetic programming becomes capable of efficiently capturing and exploiting recurring sub-patterns. This chapter describes how genetic programming with automatically defined functions successfully evolved motifs for detecting the D-E-A-D box family of 2 proteins and for detecting the manganese superoxide dismutase family. Both motifs were evolved without prespecifying their length. Both evolved motifs employed automatically defined functions to capture the repeated use of common subexpressions. When tested against the SWISS-PROT database of proteins, the two genetically evolved consensus motifs detect the two families either as well, or slightly better than, the comparable human-written motifs found in the PROSITE database. 1 . Introduction The structure and functions of living organisms are primarily determined by proteins (Stryer 1995). Proteins are large polypeptide molecules composed of sequences of up to several thousand amino acid residues. All proteins are composed from the same repertoire of 20 amino acid residues (conventionally denoted by the letters A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y). Subject to only a few minor qualifications, the three-dimensional location of every atom of a protein in a living organism is fully determined by its sequence (its primary structure) of amino acid residues (Anfinsen 1973). The protein's three-dimensional structure (its tertiary structure or conformation), in turn, determines the biological function and activity of the protein within a living organism. Thus, effectively all of the information about the biological function and activity of a protein is contained (albeit deeply hidden) in its primary sequence (i.e., the linear sequence of letters over an alphabet of size 20). SWISS-PROT is a massive, systematically-collected, periodically-reviewed, annotated database of protein sequences that is maintained by the University of Geneva and the European Molecular Biology Laboratory (Bairoch and Boeckmann 1991). Release 30 (October 1994) of SWISSPROT, for example, contains 14,147,368 amino acid residues from 40,292 sequences from hundreds of different species. The Human Genome Project and other research efforts in molecular biology are rapidly increasing the number of entries in SWISS-PROT and other databases of 3 protein sequences and genomic DNA sequences. Automated techniques (such as those of machine learning) may prove useful or necessary for analyzing this accumulating data. Most proteins appear in many different species; however, the primary sequences of the "same" protein in two different species are often not identical. For one thing, the primary sequences of the "same" protein in two different species may differ slightly in length. Moreover, even after using an alignment algorithm (e.g., Smith and Waterman 1981) to align the related sequences of somewhat different lengths, the residues found at a particular aligned position often will still differ. The reason is that only relatively small subsequences of the overall sequence are responsible for the biological function, activity, and structure of the protein. Over millions of years, evolution has substituted dissimilar residues at non-critical positions. Even when one locates the relatively small subsequence of the protein that is responsible for the protein's biological activity, only a few of the residues of the subsequence will prove to be identical (that is, conserved) because evolution has also substituted chemically similar residues at these critical positions. Sometimes, amidst all the differences, it is possible to identify certain high specificity, high sensitivity patterns (called motifs, sites, signatures, or fingerprints) in a set of sequences for biologically similar proteins. If a motif is defined well, it will detect a biologically-important common property. The residues in such motifs often prove to be directly responsible for the essential function and activity of the protein. PROSITE is a database of biologically meaningful patterns found in protein sequences (Bairoch and Bucher 1994). Release 12 (June 1994) of PROSITE, for example, contains 1,029 different motifs. Motifs are entered in the PROSITE database after careful consideration by Amos Bairoch at the University of Geneva and his colleagues. Since the intended primary purpose of PROSITE is to detect families of proteins in computerized databases, a motif is included in PROSITE if it detects most (preferably all) sequences that have a particular biological property (i.e., has few false negatives), while detecting few (preferably zero) unrelated sequences (i.e., has few false positives). 4 Automated methods of machine learning may be useful in discovering biologically meaningful patterns that are hidden in the rapidly growing databases of genomic and protein sequences. Unfortunately, almost all existing methods of automated discovery require that the user specify, in advance, the size and shape of the pattern that is to be discovered. However, in practice, the discovery of the size and shape of the pattern may, in fact, be the problem (or at least a major part of the problem). Moreover, none of the existing methods of automated discovery have a workable analog of the idea of a reusable, parameterized subroutine or subprogram to capture and exploit repeated occurrences of regularities or sub-patterns of the problem environment. The problem of discovering biologically meaningful patterns in databases can be rephrased as a search for an unknown-sized task-performing computer program (i.e., a composition of primitive functions and terminals). When the motif discovery problem is so rephrased, genetic programming becomes a candidate for solving this problem. Moreover, if it is also desired to reuse regularities in the problem environment, then genetic programming with automatically defined functions becomes a candidate. Section 2 of this chapter provides background on protein databases, motifs, the D-E-A-D box family of proteins, and the manganese superoxide dismutase family. Section 3 of this chapter provides background on genetic programming. Section 4 identifies the preparatory steps required to apply genetic programming to the D-E-A-D box problem. Section 5 describes the implementation of genetic programming on a parallel computer. Section 6 presents a genetically consensus evolved motif that is slightly better than the human-written motif found in the PROSITE database for detecting the D-E-A-D box family of proteins. Section 7 presents a genetically evolved consensus motif for detecting the manganese superoxide dismutase family that is as good as the human-written motif found in the PROSITE database. Section 8 states the conclusion. 2 . Background on Motifs and Proteins The D-E-A-D Box Family of Proteins and the Manganese Superoxide Dismutase Family of Proteins will be used to illustrate how genetic programming may be applied to the problem of discovering motifs in protein sequences. 5 2 . 1 The D-E-A-D Box Family of Proteins In the "Birth of the D-E-A-D box," Linder et. al. (1989) described a family of proteins (called helicases) involved in the unwinding of the double helix of the DNA molecule during the replication of DNA (Chang, Arenas, and Abelson 1990; and Dorer, Christensen, and Johnson 1990; Hodgman 1988). This family of proteins gets its name from the fact that the amino acid residues D (aspartic acid), E (glutamic acid), A (alanine), and D appear, in that order, at the core of one of its biologically critical subsequences. There are 34 proteins from this family among the 40,292 proteins appearing in Release 30 of SWISS-PROT. Proteins of this family can be detected effectively (but not perfectly) by the following motif of length nine (called ATP_HELICASE_1) that was included by Amos Bairoch at the University of Geneva in the PROSITE database: [LIVM]-[LIVM]-D-E-A-D-X-[LIVM]-[LIVM]. In interpreting this expression, the first pair of square brackets indicates that the first residue of the nine is to be chosen from the set consisting of the amino acid residues L, I, V, and M. The second pair of square brackets indicates that the second residue is chosen (independently from the first) from the same set of four possibilities. Then, the third, fourth, fifth, and sixth residues must be D, E, A, and D, respectively. The X in the motif indicates that the seventh residue can be any of the 20 possible amino acid residues. The eighth and ninth residues are chosen from the same set of four, namely L, I, V, and M. D and E are negatively charged and hence hydrophilic (water-loving) at normal pH values. A is small, uncharged, hydrophobic (water-hating). L (leucine), I (isoleucine), V (valine), or M (methionine) are moderately-sized, uncharged, and hydrophobic. Thus, ignoring the X, this motif calls for three hydrophilic residues accompanied, on each side, by two moderately-sized hydrophobic residues. The above PROSITE expression detects any of 44 × 20 = 5,120 different possible sequences of length nine (out of approximately 5 × 1011 possible sequences of length nine). When SWISSPROT is searched using the above PROSITE motif, there are 34 true positives, 14,147,333 true 6 negatives (among the 40,292 proteins), 1 false positive, and 0 false negatives. This corresponds to a correlation coefficient (Matthews 1975), C, of 0.99. Table 1 shows six of the 34 proteins containing the D-E-A-D box motif in the SWISS-PROT database. The table shows the position of the start of the D-E-A-D box motif in its second column. The third column shows the three amino acid residues in the primary sequence before the onset of the motif, the nine residues (in boldface) of the D-E-A-D box itself, and the five residues following the D-E-A-D box. Table 1 Six examples of D-E-A-D box motif. Protein Start Subsequence Human Putative ATP Dependent RNA Helicase P54 244 QMIVLDEADKLLSQDFV Rabbit Eukaryotic Initiation Factor 4A 168 KMFVLDEADEMLSRGFK Fruit Fly Vasa Protein 397 RFVVLDEADRMLDMGFS C. Elegans Putative ATP-Dependent RNA Helicase 192 KFLIMDEADRILNMDFE E. Coli ATP-Dependent RNA Helicase 155 ETLILDEADRMLDMGFA Fruit Fly Putative ATP-Dependent RNA Helicase 303 KFLVIDEADRIMDAVFQ The number of PROSITE expressions (composed of disjunctions such as shown above) covering exactly nine positions is (220)9 ~ 1054. Since the length of an expression that is capable of detecting a particular family of proteins is, in actual practice, not known in advance, the search space of the motif discovery problem is considerably larger than 1054. The question arises as to whether it is possible to use an automated machine learning technique to examine a large set of protein sequences and extract biologically meaningful motifs. Such a technique should, of course, not require advance specification of the length of the motif. When this problem is rephrased as a search for an unknown-sized task-performing computer program (i.e., a composition of primitive functions and terminals), genetic programming becomes a candidate for solving this problem. Moreover, if it is also desired to capture regularities in the
منابع مشابه
Dimensionality Reduction and Improving the Performance of Automatic Modulation Classification using Genetic Programming (RESEARCH NOTE)
This paper shows how we can make advantage of using genetic programming in selection of suitable features for automatic modulation recognition. Automatic modulation recognition is one of the essential components of modern receivers. In this regard, selection of suitable features may significantly affect the performance of the process. Simulations were conducted with 5db and 10db SNRs. Test and ...
متن کاملAutomatic Discovery Using Genetic Programming of an Unknown-Sized Detector of Protein Motifs Containing Repeatedly-Used Subexpressions
Automated methods of machine learning may be useful in discovering biologically meaningful patterns that are hidden in the rapidly growing databases of genomic and protein sequences. However, almost all existing methods of automated discovery require that the user specify, in advance, the size and shape of the pattern that is to be discovered. Moreover, existing methods do not have a workable a...
متن کاملAutomated Discovery of Protein Motifs With Genetic Programming
Automated methods of machine learning may prove to be useful in discovering biologically meaningful information hidden in the rapidly growing databases of DNA sequences and protein sequences. Genetic programming is an extension of the genetic algorithm in which a population of computer programs is bred, over a series of generations, in order to solve a problem. Genetic programming is capable of...
متن کاملShuffled Frog-Leaping Programming for Solving Regression Problems
There are various automatic programming models inspired by evolutionary computation techniques. Due to the importance of devising an automatic mechanism to explore the complicated search space of mathematical problems where numerical methods fails, evolutionary computations are widely studied and applied to solve real world problems. One of the famous algorithm in optimization problem is shuffl...
متن کاملDAMAGE AND PLASTICITY CONSTANTS OF CONVENTIONAL AND HIGH-STRENGTH CONCRETE PART II: STATISTICAL EQUATION DEVELOPMENT USING GENETIC PROGRAMMING
Several researchers have proved that the constitutive models of concrete based on combination of continuum damage and plasticity theories are able to reproduce the major aspects of concrete behavior. A problem of such damage-plasticity models is associated with the material constants which are needed to be determined before using the model. These constants are in fact the connectors of constitu...
متن کاملA Hybrid Evolutionary Approach for the Protein Classification Problem
This paper proposes a hybrid algorithm that combines characteristics of both Genetic Programming (GP) and Genetic Algorithms (GAs), for discovering motifs in proteins and predicting their functional classes, based on the discovered motifs. In this algorithm, individuals are represented as IF-THEN classi cation rules. The rule antecedent consists of a combination of motifs automatically extracte...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995